Team:ALMANACH

Inria | Raweb 2017 | Presentation of the Team ALMANACH | ALMANACH Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: Research Program

Modelling and Development of Language Resources

Construction, management and automatic annotation of Text Corpora

Corpus creation and management (including automatic annotation) is often a time-consuming and technically challenging task. In many cases, it also raises scientific issues related for instance with linguistic questions (what is the elementary unit in a text?) as well as computer-science challenges (for instance when OCR or HTR is involved). It is therefore necessary to design a workflow that makes it possible to deal with data collections, even if they are initially available as photos, scans, wikipedia dumps, etc.

These challenges are particularly relevant when dealing with ancient languages or scripts where fonts, OCR techniques, language models may be not extant or of inferior quality, as a result, among others, of the variety of writing systems and the lack of textual data. This project-team will therefore work on improving print OCR for some of these languages for this very aim (e.g. Syriac, Ge'ez, Armenian). When an ancient source is still unpublished (book, manuscript, stele, tablet...), and therefore available in raw (image) form, we intend to develop OCR / HTR techniques, at least for certain scripts (Hebrew, Coptic and Greek Uncials, Ge'ez), and construct a pipeline for historical manuscripts. Initial success for Hebrew and Latin manuscripts has been very comforting (ca. 3% CER). On the one hand, access to existing electronic corpora, especially epigraphic corpora (e.g. Aramaic, North and South Arabic), still have to be negotiated. On the other hand, data that has been produced directly in electronic form (e.g. on social media) is readily usable, but far from normalised. Of course, contemporary texts can be often gathered in very large volumes, as we already do within the ANR project SoSweet, but this results in specific issues.

An inventory of already available resources developed or used by ALMAnaCH members has been developed.

The team pays a specific attention to the re-usability (From a larger point of view we intend to be conformant to the s-called FAIR principles (http://force11.org/group/fairgroup/fairprinciples)) of all resources produced and maintained within its various projects and research activities. To this end, we want to ensure maximum compatibilities with available international standards for representing textual sources and their annotations. More precisely we consider TEI guidelines as well the standards produced by ISO committee TC 37/SC 4 as essential points of reference.

From our ongoing projects in the field of Digital Humanities and emerging initiatives in this field, we observe a real need for complete but easy workflows for exploiting corpora, starting from a a set of raw documents and reaching the level where one can browse the main concepts and entities, explore their relationship, extract specific pieces of information, always with the ability to return to (fragments of) the original documents. The process may be seen as progressivily enriching the documents with new layers of annotations produced by various NLP modules and possibility validated by users, preferably in a collaborative way. It relies on the use of clearly identified representation formats for the annotations, as advocated by ISO TC 37/SC 4 and TEI, but also on the existence of well-designed collaborative interfaces for browsing and validation. ALMAnaCH has been or is working on several of the NLP bricks needed for setting such a workflow, and has a solid expertise in the issues related to standardisation (of documents and annotations). However, putting all these elements in a unified workflow that is simple to deploy and configure remains to be done.

It should be noted that such workflows have also a large potential besides DH, for instance for valorising internal documentation (for a company) or exploring existing relationships between entities. (In this regard, we have started preliminary discussions with Fujitsu Lab and with the International Consortium of Investigative Journalists.)

Development of Lexical Resources

ALPAGE, the Inria predecessor of ALMAnaCH, has put a strong emphasis in the development of morphological, syntactic and wordnet-like semantic lexical resources for French as well as other languages (see for instance [4], [1]). Such resources play a crucial role in all NLP tools, as has been proven among other tasks for POS tagging [71], [25], [29] and parsing, and some of the lexical resource development are targeted towards the improvement of NLP tools. They also play a central role for studying diachrony in the lexicon, for example for Ancient to Contemporary French in the context of the Profiterole project. They are also one of the primary sources of linguistic information for augmenting language models used in OCR systems for ancient scripts, and allow us to develop automatic annotation tools (e.g. POS taggers) for low-resourced languages (see already [30]), especially ancient languages. Finally, semantic lexicons such as wordnets play a crucial role in assessing lexical similarity and automating etymological research.

Therefore, an important effort towards the development of new morphological lexicons is intended, with a focus on ancient languages of interest. Following previous work by ALMAnaCH members, we try and leverage all existing resources whenever possible such as electronic dictionaries, OCRised dictionaries, both modern and ancient [70], [19], [26], [27], while using and developing (semi)automatic lexical information extraction techniques based on existing corpora [69], [74]. A new line of research consists in the integration of the diachronic axis by linking lexicons that are in diachronic relation with the one another thanks to phonetic and morphological change laws (e.g. XIIth century French with XVth century French and contemporary French). Another novelty is the integration of etymological information in these lexical resources, which requires the formalisation, the standardisation, and the extraction of etymological information from OCRised dictionaries or other electronic resources, as well as the automatic generation of candidate etymologies. These directions of research are already investigated in ALMAnaCH [19], [26], [27].

An underlying effort for this research is to further the development of the GROBID-dictionaries software, which provides cascading CRF (Conditional Random Fields) models for the segmentation and analysis of existing print dictionaries. The first results we have obtained have allowed us to set up specific collaborations to improve our performances in the domains of a) recent general purpose dictionaries such as the Petit Larousse (Nénufar project, funded by the DGLFLF in collaboration with the University of Montpellier), b) etymological dictionaries (in collaboration with the Berlin Brandenburg Academy of sciences) and c) patrimonial dicitonaries such as the Dictionnaire Universel de Basnage (preparation of an ANR project with the University of Grenoble-Alpes and Paris Sorbonne Nouvelle).

In the same way as we signalled the importance of standards for the representation of interoperable corpora and their annotations, we intend to keep making the best of the existing standardisation background for the representation of our various lexical resources. There again, the TEI guidelines play a central role, and we have recently participated in the “TEI Lex 0” initiative to provide a reference subset for the “Dictionary” chapter of the guidelines. We are also responsible, as project leader, of the edition of the new part 4 of the ISO standard 24613 (LMF - Lexical Markup Framework) dedicated to the definition of the TEI serialisation of the LMF model. (Defined in ISO 24613 part 1 (core model), 2 (Machine Readable Dictionaries) and 3 (Etymology).) We consider that contributing to standards allows us to stabilize our knowledge and transfer our competence.

Development of Annotated Corpora

Along with the creation of lexical resources, ALMAnaCH is also involved in the creation of corpora either fully manually annotated (gold standard) or automatically annotated with state-of-the-art pipeline processing chains (silver standard). Annotations are either be only morphosyntactic or cover more complex linguistic levels (constituency and/or dependency syntax, deep syntax, maybe semantics). Former members of the ALPAGE project have a renowned experience in those aspects (see for instance [76], [65], [75], [61]) and now participate to the creation of valuable resources originating from the historical domain genre.

Previous |

Home | Next next